Alternate fix for conversations that stop responding due to stale execution status #2470
Alternate fix for conversations that stop responding due to stale execution status #2470DoubleDensity wants to merge 5 commits intoOpenHands:mainfrom
Conversation
… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.
|
also did not fix the issue after further testing -- please do not merge yet |
… loop, another check immediately broke if status was STUCK, preventing recovery
|
this one looks promising, was able to resume a previously hung |
|
@OpenHands /codereview-roasted |
|
I'm on it! DoubleDensity can track my progress at all-hands.dev |
SummaryI analyzed PR #2470 "Alternate fix for conversations that stop responding due to stale execution status" and provided a code review using the What I did:
I did not modify any code - the task was to provide a code review comment as requested by the |
|
@OpenHands /codereview-roasted |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
🔴 Needs improvement — Patching symptoms, not the diseaseLinus-Style Analysis[CRITICAL ISSUES] 1. The author already told us it doesn't work
Let me repeat that: the person who wrote this PR says it does not fix the problem. Why are we reviewing code that the author has disowned? This should be closed or converted to a draft, not reviewed. 2. [local_conversation.py, Lines 637-656] The original code's The 3. [local_conversation.py, Lines 646-656] The The original flow was: detect stuck → set STUCK → But here's what you missed: after
OK, that part is actually fine. But the original code had the same path (continue → break), so your change to 4. [local_conversation.py, Line 595 / Line 654 / Line 671] Three-way state tracking is a complexity smell
[IMPROVEMENT OPPORTUNITIES] 5. [PR description] "[fill in a summary of this PR]" — Really? The summary is literally a placeholder. This tells me the PR was submitted in a rush. If you can't explain what your change does in one paragraph, you probably don't fully understand what it does yet. [TESTING GAPS] 6. Zero tests for a change to the core conversation loop This modifies the central run loop — the beating heart of conversation execution. The stuck detection path, the break/continue behavior, the interaction between 7. [PR description] No evidence section No reproduction steps, no logs showing the stuck behavior, no before/after comparison. We're flying blind. VERDICT: KEY INSIGHT: |
SummaryI completed the
No code changes were made — this was a review-only task, which is correct for a |
|
@OpenHands based on this #2470 (comment) can you do an investigation on the actual cause of this issue? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Investigation: Root Cause Analysis for "Conversations That Stop Responding"After a thorough investigation of the codebase, I analyzed the run loop, TL;DRThe PR's changes to stuck detection ( Finding 1: TOCTOU Race in
|
SummaryI was asked to investigate the actual root cause of the "conversations that stop responding due to stale execution status" issue referenced in PR #2470. What I did:
Key findings — 6 root causes identified:
Conclusion: The PR's stuck detection changes ( |
|
[Automatic Post]: I have assigned @simonrosenberg as a reviewer based on git blame information. Thanks in advance for the help! |
|
Just adding that this fix has held strong under continuous use for 48 hours, have not been able to reproduce the original issue anymore. |
csmith49
left a comment
There was a problem hiding this comment.
Looks good to me, modulo the event count tracking. I can't see a situation where it actually modifies the control-flow, and worry it will act as a red herring when resolving future issues. Will approve this if it's removed or if we can figure out that it's necessary for this fix.
Do you have a script that runs the agent a few hundred times so that we can reproduce the problem? Edit: if it's a burden to run that many runs with a personal key, I can always take over :) let me know! |
|
thanks for taking a look at this @simonrosenberg ! I don't have a script or environment to automated test this in, I can only give anecdotal confirmation at the moment. I have been using OpenHands at work these past few weeks, but having limited success because of this issue breaking my conversations. Since I have put this patch in I have been using it non-stop with Nemotron 3 Super 120B on a DGX Spark w/ vLLM and OpenHands has not locked up again once, that was about three days ago now. |
|
|
||
| # Check for stuck patterns if enabled | ||
| # Only check if new events have been added since last check | ||
| # to avoid repeatedly triggering on the same pattern |
There was a problem hiding this comment.
Do you know by any chance what was the pattern? Or which of the stuck scenarios got triggered?
|
@DoubleDensity Thank you for looking into this! I think we fixed just today the cases where Qwen had tool calls in reasoning content, which caused it to the get I wonder if you could try running on an updated The problem there was that the SDK interpreted the messages with empty content and empty tool calls as empty messages (since content and tool calls were 'hidden' in reasoning where we don't read them), and then 3 empty messages trigger the STUCK detector. |
|
thanks @enyst I'll test it out! |
|
hey @enyst ! I tried running #2512 for a couple days to see it was a better fix, and it looked promising. Went ~30 hours with no issues, but now has become stuck: I will make a backup of that conversation in case there is any useful metadata you want to extract from it |
|
it does seem to at least partially fix the problem; I was able to ask OpenHands to continue after it became stuck, and was able to resume successfully. so #2512 is definitely an improvement, but it can still stop before reaching completion |
|
with #2512 it resumed, but then stopped again after only 4 turns. not sure how to explain it, but not ideal |
|
What was the reason for stopping again and did the LLM actually repeat itself? |
… RUNNING reset already happens outside the loop at lines 586-592, so we shouldn't check for STUCK again inside the loop and break.
Summary
[fill in a summary of this PR]
Checklist